-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998
CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998
Conversation
a77d66d
to
3c0661a
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
… and likely subtag overrides The generated files for ConvertLanguageData and GenerateLikelySubtags change if input files are modified. This change seeks to stablize the scripts outputs. CLDR-17897 Add overrides to Likely Subtags
3c0661a
to
e5fa96d
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
… and likely subtag overrides The generated files for ConvertLanguageData and GenerateLikelySubtags change if input files are modified. This change seeks to stablize the scripts outputs. CLDR-17897 Add overrides to Likely Subtags
da92bfe
to
0756612
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
@@ -703,10 +706,10 @@ not be patched by hand, as any changes made in that fashion may be lost. | |||
<likelySubtag from="tiv" to="tiv_Latn_NG"/> <!--Tiv‧?‧? ➡ Tiv‧Latin‧Nigeria--> | |||
<likelySubtag from="tk" to="tk_Latn_TM"/> <!--Turkmen‧?‧? ➡ Turkmen‧Latin‧Turkmenistan--> | |||
<likelySubtag from="tkl" to="tkl_Latn_TK"/> <!--Tokelau‧?‧? ➡ Tokelau‧Latin‧Tokelau--> | |||
<likelySubtag from="tkr" to="tkr_Latn_AZ"/> <!--Tsakhur‧?‧? ➡ Tsakhur‧Latin‧Azerbaijan--> | |||
<likelySubtag from="tkr" to="tkr_Cyrl_AZ"/> <!--Tsakhur‧?‧? ➡ Tsakhur‧Cyrillic‧Azerbaijan--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense. Tsakhur is written in Latin in Azerbaijan and in Cyrillic in Russia. The old value was correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the close examination -- I'll re-introduce the overrides for these languages. I'm having a problem fighting the different sources of truth :p Definitely Latn should be considered the primary script in Azerbaijan.
I think the source problem is that "Cyrl" comes before "Latn" alphabetically and when the script is re-run now it takes the first alphabetical item.
@@ -725,7 +728,7 @@ not be patched by hand, as any changes made in that fashion may be lost. | |||
<likelySubtag from="tt" to="tt_Cyrl_RU"/> <!--Tatar‧?‧? ➡ Tatar‧Cyrillic‧Russia--> | |||
<likelySubtag from="ttj" to="ttj_Latn_UG"/> <!--Tooro‧?‧? ➡ Tooro‧Latin‧Uganda--> | |||
<likelySubtag from="tts" to="tts_Thai_TH"/> <!--Northeastern Thai‧?‧? ➡ Northeastern Thai‧Thai‧Thailand--> | |||
<likelySubtag from="ttt" to="ttt_Latn_AZ"/> <!--Muslim Tat‧?‧? ➡ Muslim Tat‧Latin‧Azerbaijan--> | |||
<likelySubtag from="ttt" to="ttt_Cyrl_AZ"/> <!--Muslim Tat‧?‧? ➡ Muslim Tat‧Cyrillic‧Azerbaijan--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem. Muslim Tat is written in Latin in Azerbaijan and in Cyrillic in Russia.
@@ -1036,6 +1039,7 @@ not be patched by hand, as any changes made in that fashion may be lost. | |||
<likelySubtag from="und_Ahom" to="aho_Ahom_IN"/> <!--?‧Ahom‧? ➡ Ahom‧Ahom‧India--> | |||
<likelySubtag from="und_Arab" to="ar_Arab_EG"/> <!--?‧Arabic‧? ➡ Arabic‧Arabic‧Egypt--> | |||
<likelySubtag from="und_Arab_AF" to="fa_Arab_AF"/> <!--?‧Arabic‧Afghanistan ➡ Persian‧Arabic‧Afghanistan--> | |||
<likelySubtag from="und_Arab_AZ" to="tly_Arab_AZ"/> <!--?‧Arabic‧Azerbaijan ➡ Talysh‧Arabic‧Azerbaijan--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have something unknown in Arabic script in Azerbaijan, it's probably not Talysh (which has a pretty small community compared to Azerbaijani, where they write in Latin). It's very probably Azerbaijani in the old orthography.
@@ -1890,7 +1890,7 @@ XXX Code for transations where no currency is involved | |||
<language type="lv" scripts="Latn" territories="LV"/> | |||
<language type="lwl" scripts="Thai"/> | |||
<language type="lzh" scripts="Hans" alt="secondary"/> | |||
<language type="lzz" scripts="Latn Geor"/> | |||
<language type="lzz" scripts="Geor Latn"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ordering significant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is writing it alphabetically, so when I re-generate the script it force-alphabetizes it. There is an argument it should be ordered by usage --- however the XML is just not a good way to capture this because the labelling is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scripts (and regions) should be in ranked order, not sorted. If the code is sorting them, that's a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if Roozbehs' items are taken care of, then this would be ready to merge into 47.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad Roozbeh took a look ;) I resolved this by changing the non-Latn script for these languages to be considered "secondary" in language_script.tsv
Merging this changes ended up getting really messy so I'll post a new pull request.
pnt Pontic secondary Cyrl Cyrillic | ||
pnt Pontic secondary Latn Latin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the basis of making these secondary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promoting Grek to be the primary script for Pontic.
Really for all current Pontic speakers its Grek in Greece, Latn in Turkey, and Cyrl in Russia/Ukraine. Pontic is only spoken by very marginal populations in Turkey and Russia, but it's a large recognized community in Greece.
What's the basis for primary v secondary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary vs secondary should be based on the literate population sizes. I forget what the cutoff is, but clearly if >50% of the usage of the language is in a particular script, that would be primary, not secondary. (But again, there might be bug in the code.)
@@ -392,6 +393,9 @@ public static void main(String[] args) throws IOException { | |||
{"mro", "mro_Mroo_BD"}, | |||
{"mro_BD", "mro_Mroo_BD"}, | |||
{"ms_Arab", "ms_Arab_MY"}, | |||
{"nan", "nan_Hans_CN"}, | |||
{"nan_Hans", "nan_Hans_CN"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't hurt anything, but nan_Hans is redundant, because the algorithm will find {"nan", "nan_Hans_CN"}, and fill in.
There is a ticket open for dropping overrides that have no effect, so it is ok to keep this line for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I need to keep this like otherwise in the produced likelySubtags.xml
file, it will show nan_Hans -> nan_Hans_TW
even though as we know the Hant
script would be preferred in Taiwan. The problem is that we don't have population estimates on Simplified v Traditional Chinese script usage.
Thanks everyone for the comments! It helped me make a better version of this PR in #4015. Apologies for making a separate one -- rebasing it to the ddl/v47 branch introduced weird merge artifacts so I just made a new PR. |
CLDR-17897
While we are improving the population data and likely subtags we are generating side-effects from partial data. This adds new scripts so we can avoid these side-effects in future changes. Ultimately we will want to remove how many overrides are here but it's good to fix this.
See the data updated in this diagram:
Run this command to regenerate data:
mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags
ALLOW_MANY_COMMITS=true